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Abstract 

We provide a complete description of possible covariance matrices consistent with a Gaussian latent tree 
model for any tree. We then present techniques for utilising these constraints to assess whether observed data 
is compatible with that Gaussian latent tree model. Our method does not require us first to fit such a tree. 
We demonstrate the usefulness of the inverse-Wishart distribution for performing preliminary assessments of 
tree-compatibility using semialgebraic constraints. Using results from |Drton et al.| < [2008| > we then provide the 
appropriate moments required for test statistics for assessing adherence to these equality constraints. These are 
shown to be effective even for small sample sizes and can be easily adjusted to test either the entire model or 
only certain macrostructures hypothesized within the tree. We illustrate our exploratory tetrad analysis using a 
linguistic application and our confirmatory tetrad analysis using a biological application. 

Keywords — Gaussian; latent tree model; tetrad analysis; tree constraint; tree quartets. 


1 Introduction 


Modelling with hidden variables is commonly performed within the framework of graphical models (Lauritzen 
1996] jKoller & Friedman] |2009| l. When the observed variables are the leaves of a tree and the unobserved 
variables are interior nodes then the model is said to be a latent tree model (Choi et al. 2011 Wang et al. 2008). 


These models are used across disciplines including sociology, biology, and linguistics (Eisenstein et al. ] 
|Mourad et al.| |2013[ Zwiernik 


popular choice (Lawrence 2004). 


2010 ] 


2016]). In the case of continuous data Gaussian latent tree models became a 


Standard latent tree model selection techniques often assume a priori that the data generating process is 
driven by some latent tree model. So in particular the appropriateness of any given tree model is often not 
assessed in absolute terms but only relative to other tree models. Knowing whether any latent tree model could 
adequately explain what is observed is pertinent in phylogenetic settings where, for example, the effect of a 
possible horizontal gene transfer (e.g. Hao & Golding| ( 2008} )) makes any underlying latent tree model hypothesis 
a dubious one. 

By characterising the covariance space related to Gaussian latent tree models, we can better assess the suit¬ 
ability of trees or the fit of a particular tree for a data set. In this paper we present the complete description of 
this model class by relating this to the space of phylogenetic oranges (Engstrom et al.| 2012| Gill et al!] 2008; 
Kim 2000 Moulton & Steel, 2004). Such a complete description had been known for a simple tree with only 


* n. 1. shiers @ Warwick, ac .uk 
t piotr.zwiernik@gmail.com 
% j .aston@ statslab.cam.ac.uk 
§j. q.smith@warwick.ac.uk 


1 






















































four leaves (see |Pearl & Xu| ( fl987| Theorem 2)) or for a star tree ( see|Bekker & de Leeuw| ( fT987| l). For a general 
tree, only the defining equations have been derived; see Sullivant (2008 Corollary 6.5). 

Our method will use the description of Gaussian latent tree models in two scenarios. In the first setting 
we are interested in whether any latent tree model is a possible explanation for a given data set. In the second 
situation we fix a latent tree model. In both situations the alternative hypothesis is given by the saturated model. 
We illustrate these methods in Section |5~4| where we test a previously hypothesized phylogenetic tree applied to 
certain yeast species. In both examples it is contentious whether the class of phylogenetic trees is appropriate. 
This is the question that we are able to address directly in our analyses and something that can be done without 
first fitting the model. 

Let Z = (Zy) ve ij be a random vector whose components are indexed by the vertices of an undirected tree 
T = ( U,E ) with edge set E C U x U. The tree T induces a Gaussian tree model N(T) for Z, which is a 
Gaussian graphical model on T (Lauritzen 1996 Section 5.2). For any two nodes u,v £ U, let ph(in;) denote 
the set of edges on the unique path between u and v in this tree. Then the model N(T) is the collection of all 
multivariate normal distributions on for which Z„ and Z v are conditionally independent given a subvector 
Zc whenever the set C C U\ {u, i>} contains a node on ph(itu). Note that for three nodes u,v,w £ U the 
conditional independence of Z v and Z w given Z„ is equivalent to p vw = p uv p U w It follows that a normal 
distribution with correlation matrix R = (p uv ) belongs to N(T) if and only if p uv = rieeph(uu) P e f° r 
u, v £ U, where p e = p uv when e is the edge (u, v). 

In this paper we study Gaussian latent tree models where only the observed random variables correspond to 
the tree’s leaves. We henceforth denote the set of leaves of this tree by V. A typical such evolutionary tree, one 
of Romance languages, is displayed below in Fig. [T] where the observable, extant languages are represented as 
its leaves. 


Portuguese Iberian Spanish American Spanish 



Figure 1: Quintet tree X 5 relating five Romance languages. 


Definition 1. The Gaussian latent tree model M (T)for the subvector X = {Z v ) v ^.y is the set of all V -marginal 
distributions of the distributions in N(T), where the V-marginal distributions are those associated with leaf 
variables. 

The parameterization of M (T) is induced from the parameterization of N(T) and given by 

Pij = | Pe ( 1 ) 

eeph(ij) 

for all i,j £ V. As the variances a uu for u £ U \ V never appear in this parameterization, without loss of 
generality, we can assume they are equal to 1 . 

2 Semialgebraic description of the latent tree model 

2.1 Tree metrics and phylogenetic oranges 

Let T = (U, E) be a tree with leaf set V C [J. Associate to each edge a non-negative number d e , which we 
interpret as the length of this edge. Then for any two leaves i,j £ V we can compute the distance between 
them as dij = Seephfo) ^e• It is eas y to check that a collection of such distances for all pairs u, v £ V forms a 
metric. The set of all metrics that arise in this way for all T with leaves labelled by V is called the space of tree 
metrics. We recall the following result. 
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Theorem 1 (Buneman](jT974)). A collection of positive numbers dij for i,j £ V forms a tree metric if and only 
if for all (not necessarily distinct) i, j, k 7 l £ V we have 


ma x(d ik + dji,du + d jk ) > d i3 + d k i- 

Equivalently, for any three sums dik + dji, du + dj k , dij+d k i two are equal and not less than the third. Moreover, 
if the above inequalities hold, then generically T is uniquely identified. 

In the above theorem the term generically means that the statement holds outside a set of measure zero corre¬ 
sponding to the vanishing of some edge lengths d e . We note that a more precise statement is also possible if we 
allow semi-labelled trees, see |Semple & Steel| ( |2003| Section 7). A careful analysis shows that this generic tree 
is always a binary tree, i.e. a tree with all its inner nodes of degree three. The usual triangle inequality follows 
from setting i.j, k distinct and k = l in Theorem[I| which in turn implies that every tree metric is a metric on V. 

Corollary 1. The space of tree metrics on a fixed tree T is given as a set of all metrics on V satisfying: for any 
four distinct leaves i,j, k, l such that ph(*, j) (~l ph(fc, l) = 0, we have 

di k + dji = du + dj k > dij 4- d k i- 

We emphasize that ph(z, j) is the set of edges and hence, for example, for a star tree any four leaves i, j, k. I 
that satisfy ph (i,j) n ph(fc, l) = 0. The condition ph(i, j) D ph(fc, l) = 0 implies that the induced subtree over 
i. j , k. I , that is, the smallest connected subgraph of T containing i.j. k, l, looks like a quartet tree in Figure[2] 
This also explains the conditions of Corollary [I] 



Figure 2: A quartet tree ij\kl 


Another closely related space defined over a tree is the space of phylogenetic oranges; see e.g. jKirn](2()0()K 
Moulton & Steel' (2004 1. For a fixed tree T this is given by the same parameterization o as the Gaussian latent 
tree model but where in addition the edge correlations p e are non-negative. The set of all points in PA 

that arise in this way is denoted by PO(T) and it forms a toric cube as defined in Engstrom et al. (2012]l. The 
union of all PO(T) is denoted by PO(V). 

Denote by l ’0 + (7’) and PO+(P) the subsets of PO(T) and PO(V') respectively, where all coordinates are 
assumed to be strictly positive. This implies in particular that the corresponding edge correlations p e must be 
strictly positive. The space of tree metrics on a fixed tree T is isomorphic to PO + (T), with the isomorphism 
given by d l3 = - log (pij). 


Theorem 2. Let R = ( Pij)i,jev and suppose that pij > 0 for all i, j £ V. The following tw’o statements hold: 

(1) R £ PO(V') if and only if for every four not necessarily distinct elements i, j , k, l in V at least two out of 
three products PikPji, pupjk, PijPki are equal and less than or equal to the third. Moreover, if this holds then T 
with the property R £ PO(T) is generically identified uniquely. 

(2) For a fixed T, the space PO(T) has dimension |2£|. This is described by the following set of constraints. 
For any four distinct elements i, j , k , l ofV such that ph(*, j) IT ph(fc, l) = 0, we have that 


PikPji — PilPjk PijPki• ( 2 ) 

Moreover, for any three distinct leaves i,j , k we have the triangle inequality PijPik < pj k . 


2.2 Latent tree models and phylogenetic oranges 

We are now ready to derive the semialgebraic description of the model M(T). Let iS + (P) denote the space of 
all symmetric positive definite |P| x |P|-matrices. 
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Figure 3: Tripod tree. 


Theorem 3. Let T be a tree and let R = \p;j\ £ <S+(V) be a correlation matrix. Then R £ M(T) if and only if 
If = [|p,j |] £ PO(T) and PijPikPjk > 0 for any three distinct i,j , k £ V. 

The proof is given in the appendix. 

Example 1. Let T be the tripod tree in /Tp|j] The space of correlation matrices in M(T) is described by 

Pl 2 pl 3 P 23 > 0 , IP12P13I < \P 23 1 1 IP12P23I < \Pl 3 | j IP13P23 1 < |pl2 1 - 

If p \21 Pi 3 > P 23 > 0 then by Theorem^2) the space described by the above inequalities corresponds to PO (T). 
There are three other sign patterns for p 12 , P 13 , P 23 that ensure that P 12 P 13 P 23 > 0. For every such pattern we 
obtain a copy o/PO(T). Quite remarkably, the space of the correlation matrices in M(T ) looks exactly like the 
three-dimensional slice of the corresponding binary latent class model; see^fllman et al. ( |2075j Figure 1 ). It is 
interesting to note that such constraints cannot, in general, be neglected. For example, simple calculations show 
that the ratio of the volume of M(T) to the volume of all 3x3 correlation matrices is only —y « 0.2. 

Based on Theorem 02) and Theorem0we formulate the following result. 

Proposition 1. IfT is a fixed tree then the space M(T ) has dimension \ V\ + \E\. Let E be a covariance matrix 
with no zeros. Then E £ M(T) if and only if for any three distinct leaves i,j, k 


(3) 


(4) 


{tZkkttij tTjktJjk) (c'jjO'jk O ij Oj k) {tt H<Jj k tT-ij(T-ik) f 0 , 

and for any four distinct elements i,j , fc, l ofV such that ph(*, j) n ph(fc, l) = 0 

This full algebraic and semialgebraic description can be viewed as a generalization from star trees to general 
trees of the main results in Bekker & de Leeuw ( 1987|>; Pearl & Xu ( 1987| l. An analogous description of the 
second order moments for binary latent tree models was given in Zwiernik & Smith] ( 201 l| l. The similarity of 
both descriptions comes from the fact that the parameterization of correlations in the binary latent tree model is 
precisely 0 ; see|Zwiernik & Smith|(|2011] Lemma 4.1). 


2.3 Necessary constraints for non-Gaussian tree models 

Fix an inner node r in a tree T and direct all edges away from r to obtain a rooted tree T r . Set 

Z r = e r and Z v = X v Z u + e v for u—>vinT r , (5) 

where A„ £ R. and e v for v £ U are independent random variables with mean zero. It is known that the vector 
Z follows the Gaussian tree model on T if and only if all e v are Gaussian. Its subvector X corresponding to the 
leaf nodes will then follow the latent tree model on T. 

It is natural to ask what happens if e v are not jointly Gaussian. In a nonparametric setting we could instead 
assume that e v have distributions in the family of all univariate distributions with mean zero and finite variance. 
In this case it is easily shown that all second order moments of Z exist, = cov(Z u , Z v )/var(Z u ), and the 
correlations satisfy the parameterization in 0. In particular, we have the following result that provides a set of 
necessary constraints on the correlations of a non-Gaussian tree model. 

Theorem 4. Suppose that Z is a random vector satisfying the recursive equations in 0/or a rooted tree. If all 
e v have finite variance, then the correlation matrix of X must satisfy the constraints of the Gaussian latent tree 
model. 

So it appears that our results, whilst focused on Gaussian modes, apply to and could be extended beyond this 
setting. 
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3 Utilising semialgebraic constraints 


We now describe how the semialgebraic constraints can be used more formally to give an indication of tree- 
compatibility. Here the constraints in ([3]), which hold for every tree topology, will be called tree-compatibility 
constraints. A test based on these constraints can be used as an effective preliminary assessment tool to inform 
whether it is legitimate to proceed to a more complex tetrad analysis. For a fixed T we can further extend our test 
by including the inequality constraints in 0 The resulting constraints are called T-compatibility constraints. A 
test of fit based on such constraints is called a tree-compatibility or T-compatibility test as appropriate. 

A straightforward but effective assessment of T-compatibility constraints can be obtained from the posterior 
probabilities by applying an inverse-Wishart prior on the sample covariance. More precisely, if E is a sample 
covariance matrix based on a sample X of size n from A/" m (0, C), then the estimated scatter matrix is calcu¬ 
lated as S = ri Ti = XX T and it is well known that the scatter matrix is Wishart distributed S ~ W m (n, C ) 
(Wishart 1928). A common prior distribution for unknown covariance C is the inverse-Wishart W“ 1 (riO) Co), 


e.g. Gelman et al.|([20T3]);|Carhn & Louis (2008 ); Roverato (2002 ). The inverse-Wishart is a conjugate prior and 


so the posterior density p(C \ X) is inverse-Wishart W m (no + n, Co + S). As in 


Roverato 


< |2002 


, for Co the 

identity matrix I\y\ can be used and by letting no = m ensure that the prior density is well defined. Then C \ X 
can be sampled with each draw being translated to a correlation and then tested against the constraints. After N 
such draws from the posterior distribution an estimate of the posterior probability that C satisfies the positivity 
constraint can be obtained. Of course other choices of families of priors could be chosen instead (for example 
the scaled inverse-Wishart (O’Malley & Zaslavsky} 20081) or we could use a strategy that models correlation 
and covariance separately ( [Barnard et al. 2000| ). However, these alternatives bring additional computational cost 
and complexity. Alternatively, it may be possible to adapt the work on inequality-constrained hypotheses to this 
framework using Bayesian methods, see |Van de Schoot et al.| ( [20T2| l; |Gu et al.| ( |2014| ); |Gardner et al.| ( |2014[ ). 

In Example [I] an estimate of the probability of C satisfying the semialgebraic structure of M{T) can be 
constructed using indicator functions. For each draw l from the relevant inverse-Wishart posterior distribution 
for E, the following identity is evaluated: 


r 123(^) — 1{(<5’330'12 — ^13T23) (^"22^13 — <5'12<5'23)( < 5‘11 < 5‘23 — ^ 12 ^ 13 ) > 0} 


(6) 


where a t j ,i,j = 1, 2, 3 are the covariances corresponding to covariance draw l of the posterior, the index l being 
dropped to keep the notation clean. The posterior probability of tree-compatibility is thus estimated using: 


1 


N 


i?123(E) = ^5>i23(S) 


(7) 


;=i 


For a tree with four variables such that ph(l,2) and ph(3,4) do not intersect, the final test of the inequality 
constraints is then: 


7 ?12|34(£) — ^ 1 {o’140'23 ^ 0'12C I '34 < 0}l{(fi 3 CT24 — d"l2Ci34 < 0} 0 r\j k ( E) 


N 


N 


i=i 


1 <i<j 
<k< 4 


( 8 ) 


These sampling approaches do not extend to the algebraic constraints because the set of draws from the posterior 
satisfying an equality constraint will have zero probability. Thus an alternative approach is needed that uses 
sample distributions of the minors of a covariance matrix. 


4 The sample distribution of algebraic constraints 

From the previous section and Theorem [2|7), the signs of tetrad constraints (Jik<Jji — <Ju<Jjk and other quadratic 
binomials of the form <Ja<Jjk — cr,; ; cr,; fc provide essential information about whether a Gaussian distribution lies 
in M(T). This type of constraints can be realized as minors of the covariance matrix E, that is 

det(E , Jjfc ;). det(Ej Ji jfc), (9) 
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where 'Eij^i denotes the 2x2 sub-matrix of E with rows i and j and columns k and l. Let {™} denote the set of 
all subsets of {1,, m} of cardinality two. We now propose the following estimator of the value of det(Cj j) 
for 

Qi,j = ZJZ .— T3 det(Sj t j). (10) 


n{n — 1) 


We note from 


Drton et al. 


(2008, Corollary 4.2) that Qi } j is an unbiased estimator of det(C/ j j). 

It is convenient to introduce the follow- 


In what follows we provide the covariances between different Qj j. 
ing notation. For an to x to matrix A let A 12 ' denote the matrix with rows and columns indexed by elements { } 
whose (J, J)-th element is the corresponding minor det(.4 /./). With this notation, the matrix, whose elements 
are the estimators Qi.j, is /{n(n — 1)). 

Sadly, there is no simple explicit formula for covariances of various 2-minors. However, these can be com¬ 
puted if the true distribution C is known. From[Drton et al.|(|2008| Proposition 3.3) 


cov(S^) = {(C 1 / 2 )^ <g> (tf 1 / 2 )( 2 )}{cov(Wrt 2 >)}{(C ,1 / 2 )( 2 > ® (C 1 / 2 )^}, 


( 11 ) 


where W has standard Wishart distribution VV m (n, I) and ® is the Kronecker product. 

In the rest of this section we provide a complete description of the covariance matrix cov(H / l 2 l). Our 
discussion follows |Drton et~ak ( |2008 Example 4.6). This gives the same derivation for the case m = 4. We 
show below that the generalization to m > 4 is straightforward. 

The matrix cov(W^) has many symmetries that we want to exploit. Note that for all /, J £ {™}, 
det Wi j = det Wjj and hence 

cov{det(FF/,j),det(IT r A -, L )} = cov{det(Wjj), &et(W K , L )} = cov{det(lC A -, L ), det(W7 : j)}. 

We can therefore, without loss, consider only unordered pairs of sets (J, J), where I = {i,j} and J = {k. I } 
with i < j, k < l and either i < k or i = k and j < l. 

Let AAB = {A \ B) U (B \ A) be the symmetric difference of two sets. We split the rows and the columns 
of cov(kpl 2 l) into blocks according to the value of IAJ and KAL. With this convention, by Drton et al. 


(2008 Corollary 4.2 and Proposition 3.4), cav(W^) is a block-diagonal matrix. Therefore, it is enough to 


describe its diagonal blocks. Since |/AJ| € {0,2,4}, we have three types of blocks. We first describe the block 
corresponding to IAJ = KAL = 0, or equivalently I = J, K = L. This block forms a (”) X (”) -matrix that 
satisfies: 

( 0 1 1 n K\ = 0 

cov{det(W/ t j), det(FFic j ic)} = < 2 n(n— l) 2 1 1 (T K\ = 1 

[ 2n(2n + l)(n — 1) I = K. 

We now have ('") blocks corresponding to IAJ = KAL = {i,j} for 1 < i < j < to. Every such block 
is an (to — 2) x (to — 2)-matrix, where I = {i, fc}, J = {j, k}, K = {i, (}, L = {j , (} for some k < l £ 
{1,..., to} \ {i, j}. All of these matrices have two types of elements. The diagonal entries (k = l ) are equal to 
n(n + 2)(n— 1). The off-diagonal elements (k < l ), up to a sign, are n(n— l) 2 . The sign depends on the relative 
order of i, j , k, l. By Drton et al. ( |2008[ Theorem 4.5), the sign is positive if k < i < j < l. Now a simple sign 
analysis shows that the sign is negative only if either i < k < j < l or k < i < l < j. This yields that: 


n(n + 2 )(n — 1) 

cov{det(W ik jk),det(Wiiji)} = { -n(n - l ) 2 

n{n — l) 2 

Finally, there are (™) blocks corresponding to IAJ = { i,j,k,l }, where 1 <i<j<k<l<m. Each 
such block is a 3 x 3 matrix of the form 


k = l 

i < k < j < l or k < i < l < j 

otherwise. 


ik,jl 

il,jk 

b 

-b - 

a 

b 


a _ 


where a = 2n(n — 1 ) and b = n(n — 1 ). 
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5 Quartets and applications of tetrad analyses 


5.1 The method of quartets 

For any four distinct leaves i,j,k,l £ V we say that q l:i M = ij\kl forms a quartet of T if the paths ph(z,_j) 
and ph(fc, l) are disjoint, c.f. Figure[2] A binary tree T displays the set of quartets Q if each quartet q £ Q is a 
quartet of T. A set of quartets Q is said to determine T if T displays Q and T is the unique tree displayed by Q 
(Semple & Steel 2003) 1; the set of all quartets displayed by T is denoted by Qt- Quartets can be considered as 
fundamental components of binary trees; see Dress et al. (2012) for more details. A set Qt is said to be minimal 
if there exists no element q £ Qt such that Qt \ {<?} determines T. Griinewald et al. (2008 Theorem 2.4) 
provides the minimum size of any Qt (i.e. the size of the smallest minimal defining quartet set), which for 
a binary tree is just the number of internal edges of T. Furthermore, Semple & Steel ( j2003| Theorem 6.8.8) 
provide a quick method for constructing minimal defining sets of quartets that define binary phylogenetic trees. 

Let V cUbe such that V = {/. j, k, / }, where these elements are distinct. Consider three random variables 
Qikji, Qu,jk and Qij y ki as defined in ( fT0| ). By Theorem[ 2 ] if a tree model holds, then the mean of one of the 
three will be zero and the other two means will be equal up to sign. So these Qj j can be used to test the algebraic 
constraints in Proposition]!] 

Here we focus on testing the vanishing tetrads, i.e. testing whether the quartet q %:h u is displayed in T given 
the data. To test a particular binary tree T, a set Qt is required, i.e. a set of quartets Q that determines T. The 
number of edges of T is 2m — 3 and so the Gaussian latent tree model on T has codimension ('") — (2 m — 3). 
This means that to test a model, we need to work with quartet systems Qt of size quadratic in m. On the other 
hand if we believe that the data come from a latent tree model, then to only find the corresponding tree T we 
can work with any minimal quartet system determining T, and these are of size m — 3 (the number of internal 
edges). This makes a big difference for larger trees. 

In practice, one may wish to select Qt such that it is minimal (of size (™) — (2m — 3)), i.e. it contains 
no redundant quartets, because otherwise the covariance of minors matrix may be close to being singular; see 
Bollen & Ting ( 1993[ >. However, there may not always be an obvious reason for selecting one minimal defining 
quartet set Qt over another. In such cases one approach is to randomly select a number of sets to assess the 
robustness of the results; see Bollen & Ting (1993 i. For each q t jjd £ Qt consider the corresponding Qij.ki as 
in (10 1 and define Qt = [Qij,ki] to be the vector of these Qij^ki- We write Qj^k i for the sample means of the 
observations of Qij,ki ■ Since Qt is a consistent estimator of Qt (see Drton et al. (2007)), as the sample size n 


tends to infinity any tree T is uniquely identified by the i,j, k, l such that = 0. 

To standardize the data we use the sample covariance matrix E q t which has dimension p = \Qt\- Alterna¬ 
tively we can use its proxy E q t . This can then be obtained by recycling cov(IF' 2 )) computed in Sectionjdjand 
using (111 substituting C for the sample covariance of original variables E. The matrix E q t can be obtained 
much more efficiently than An appropriate simultaneous test statistic (H2h is provided in Bollen & Ting 

( fT993> , 

1~ = Qt'Eq t Qti ( 12 ) 

where A 4 is the transpose of A. Given T is constructed with p algebraically independent quartets, the asym ptotic 
distribution of this test statistic is ^^-distribution with p degrees of freedom. Compare (12 1 with Bollen & Ting 
(1993 (20)) where their E tt is the covariance of \JhQt- Here the sample size n is incorporated implicitly 
through E^ so (12 i provides a significance test for the equality constraints in <|4b, where the required moments 
°f Qi.j are given in Section]]] This provides a quick method for assessing whether a Gaussian data set appears 
consistent with the algebraic constraints associated with any tree model. 

In deriving the asymptotic distribution in ( [T2| we implicitly assume that the true covariance matrix is a 
sufficiently regular point of the given tree model. In practice, it is enough to assume that the true covariance 
matrix contains no zeros; see Section 5 in |Drton et al.| ( |2016b| ) and |Drton et al.| ( |2016a| l. 

Hypothesis testing for vanishing tetrads can be used for both confirmatory tetrad analysis and for exploratory 


tetrad analysis. There are many algorithms for obtaining candidate trees, for instance see Junker & Schreiber 


( |201 1[ ); [Sung| ( j2009l ) for surveys of methods. However, often there is no way to assess the suitability of the 
finally chosen tree. Confirmatory tetrad analysis takes a candidate tree and provides an absolute rather than 
relative value as to how well the data supports the purported tree. 
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In the case of a large tree it is infeasible to test all quartets at once. On the other hand, it is straightforward 
and very stable to test single quartets or a small subset of them. One advantage of this approach is that it allows 
us to identify easily certain macrostructures of the tree which may lead to more robust techniques for finding the 
underlying tree. We now illustrate confirmatory and exploratory techniques both for simulated data and some 
linguistics data sets. 


5.2 Basic simulations for the method of quartets 

In this section we provide a basic analysis of the methods discussed in the previous section. The only difference 


5.1 


The data in our simulations come from the 


from the previous applications of this method in other contex ts is that in (12 1 we explicitly replaced the sample 
covariance of the tetrads with £ q t as explained in Section 
quintet tree model 12|5|34; c.f. Figure[l] 

We first randomly choose the true covariance matrix C by sampling the edge correlations uniformly from the 
interval [1/2,1], Given this random true covariance matrix, we can now repeat the following evaluation proce¬ 
dure 10,000 times. We sample n = 60 (5 times the dimension of the model) points from the given distribution 
C. In this scenario, standard packages that might be used to find the maximum likelihood estimate, such as the 
sem package (Fox et al. 2014!, are unstable. On the other hand any set of quartets can be easily tested, and this 
does not even require any fitting of the model. Moreover, the sample distribution of the test statistic is already 
very close to the asymptotic distribution. Quite surprisingly this proximity remains true even when the sample 
size is only about twice the dimension of the model. Of course, in this case, the power of the test will be much 
lower. 

In Figure [5] we show the simulated values of test statistics of the form ( fl2| ) compared with their theoretical 
asymptotic distributions. Figure |5ja) depicts the statistic built on a single tetrad constraint for the quartet 12134. 
This constraint holds for the data generating distribution and therefore the test statistic is expected to have 
asymptotic y 2 -distribution with one degree of freedom. We see that the histogram is very close to the theoretical 
distribution. For comparison, in Figure [5Jb) we show the sample distribution of the same test statistic for the 
quartet 13|24. This constraint does not hold for the data generating distribution and we see that the sample 
distribution of the corresponding test statistic is very far from y 2 . The test statistic can be easily set up for any 
subset of quartets. In Figure [5jc) we plot the test statistic to test two quartets 12135 and 15|34. This is the 
minimal set of quartets that identifies the quintet tree 12|5|34. This means that these two particular quartet will 
not be simultaneously satisfied for any other tree model . Again, the sample distribution lies very close to the 
asymptotic distribution, which in this case is y 2 . 

In Figure |5fd) we test simultaneously a minimal set of quartets defining the quintet tree model; these are: 
12134, 12135, 15134. In this case the sample distribution of the test statistic also lies close to y 2 with a slightly 
smaller variance. The reason for that is that the true distribution is closer to a mixture of ^-distributions. As 
a result, the test based on our statistic is typically more conservative. To obtain a better understanding of its 
performance we compare it with the structural expectation-maximization algorithm Friedman et al. ( |2002) > as 
applied to Gaussian latent tree models. This algorithm tries to find the tree that gives the maximum value of 
the likelihood function. However, like the standard expectation-maximization algorithm, it often gets stuck in 
a local maximum. In our simulations we generated 100 data sets from the given quintet model. If the sample 
size n = 60, then for our particular choice of a correlation matrix with all edge correlations equal to 0.7, we 
obtained the correct tree only 68 out of 100 times. On the other hand, our tetrad method always confirms the 
correct tree on any significance level smaller than 0.1. If n = 200 then the structural expectation-maximization 
algorithm was correct 99 out of 100 times, and again our quartet method was always correct. We emphasize that 
in a less ideal situation, for example, when some edge correlations are small, or in the presence of some partial 
misspecification, the structural expectation-maximization algorithm will perform poorly because the likelihood 
function is less stable. In contrast, our computations show that the quartet method tends to be much more robust. 


5.3 Exploratory tetrad analysis example: linguistics 

Consider now the linguistic data set from Shiers et al. ( |2014[ ). This comprises phonetic functional spectrogram 
data from five Romance languages: French, Italian, Portuguese, and two forms of Spanish, namely American and 
Iberian. Acoustic data have provided new insights into language developments (e.g. Bouchard-Cote et al. (2013), 
Aston et al.l(2010|). Here the evolutionary dependencies between spoken numbers is studied with each extant 
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language treated as a leaf vertex. The high dimensional spectrogram data is projected from 8100 dimensions to 


15 dimensions using a variation of canonical variate analysis, the full details of which can be found in Shiers et al. 


( 2014| >. Each of the 15 canonical components projects the mean word data to obtain 15 new data sets referred to 
as canonical scores. Each canonical component accounts for a particular combination of phonetic variation and 
each set of canonical scores is considered independently. This gives us the flexibility to hypothesize different 
evolutionary relationships for different aspects of the speech. For each set of canonical scores a 5 x 5 covariance 
matrix is calculated between the five languages. Royston’s multivariate normality test (Royston 1983| does not 
reject Gaussianity at the 0.01 level for any of these 15 sets of scores. 

We sampled 10 5 covariance matrices from the inverse-Wishart posterior for each of the sample covariances 
Ei,..., E 15 . We then performed a tree-compatibility test with respect to the positivity constraint implied by the 
triangle inequalities in Theorem[2|2) for each canonical component. We identify four such components, the first, 
fourth, sixth, and second, with high posterior probabilities, respectively 1, 0.89, 0.77 and 0.74, which warrant 
further investigation. 

Considering the quintet tree in Fig. [T] there are 15 different labelled binary trees to test. In order to test 
a particular configuration of labels we construct a set of minimal defining quartets Q for the quintet tree as 
referenced in Section |5T| in the case of the quintet tree this smallest minimal set is two. 

For each of the four dimensions of interest, using the sampling distributions given in Section [4] and the test 
statistic © with two degrees of freedom, a p-value can be calculated for each of the 15 non-isomorphic trees 


with languages as leaves. To retain an overall significance rate of less than 0.05 a Bonferroni correction (Dunn 


1961 is applied such that the significance level is set at 0.05/15 ~ 0.0033 per test. If more than one tree is not 
rejected then the candidate tree proposed by exploratory tetrad analysis is that with the highest p-value. We find 
that multiple trees exceed the threshold for all four components. The highest p-values for the first, second, fourth 
and sixth components were 0.524, 0.960, 0.775 and 0.902 respectively relating to the candidate trees: 12|4|35, 
1315124 ,14|2|35, and 23|1|45 respectively with coding 1 = French, 2 = Italian, 3 = Portuguese, 4 = American 
Spanish, 5 = Iberian Spanish. 

For illustration we focus on the candidate tree for the second component, which is displayed in Fig [I] It is 
known from the analysis and expert interpretation in Shiers et al. ( |2014| l that this component is likely to relate 
to variation in vowel sounds, nasality, and the lip rounding of language speakers. By isolating these phonetic 
features and identifying an accompany tree that fits the data we can gain insights which may have otherwise 
been obscured. For example, from this particular analysis we could hypothesize that the differences in nasality 
of Italian and French evolved independently conditional on the common ancestor of Spanish and Portuguese. 
In combination with expert judgement, such statements can provide good starting points for further analyses of 
these features in relation to a specified tree. 


5.4 Confirmatory tetrad analysis example: biology 

We next consider a data set consisting of growth curves for five yeast species each observed in the same 96 
environments, each species with at least two replicates. The growth was recorded at approximately six minute 


intervals over a period of just over 26 hours. These species have been studied before ( |Marcet-Houben & Ga- 
baldon 2009) and a phylogeny has been suggested as in Fig. |4] However, Libkind et al. (201 1|) hypothesize that 


yeast species S. bayanus is a hybrid involving S. cerevisiae. This alternative hypothesis would violate the tree 
assumption. Previous research has indicated that for studied yeast species there is positive correlation between 
growth-related phenotypic variation and genotypic phylogenetic relationships, e.g. Liti et al. (j2009|); j Warringer 


et al. (2011 1 . Thus, this leads us to consider the yeast growth-curve data to investigate evolutionary questions. 
We carried out a confirmatory tetrad analysis to assess whether the proposed tree structure in Marcet-Houben & 
Gabaldon ( 2009[ ) was reflected in any aspects of the growth data. 

To pre-process the data, a smoothed cubic spline basis was fitted to each growth vector resulting in a set 
of functional data objects which were then regularly evaluated to obtain comparable discretized representations. 
Mean vectors were then calculated for each species and environment and then these were standardized to remove 
mean environmental effects. We then performed a principal component analysis across species to identify the 
core variability of the growth curves. Note that the first four components account for over 99% of variability. 
More detailed analysis, not reported here, can help interpret these components. For example, the first component 
relates only to growth variation in hours 10 to 26, whereas the second component relates to growth variation 
peaking at 12 hours with opposite growth variation from 18 hours onwards. 
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For each of the mean species projections in these dimensions, the sample covariance matrix was constructed. 
As a first step, the inverse-Wishart approach specified in 0 was implemented. Recall that the tripod constraints 
are tree-compatibility constraints and thus require no tailoring to a specific T. Hence, these can be utilized very 
simply to narrow the list of components to test as part of a confirmatory tetrad analysis. The tree-compatibility 
for the first four components were 31%, 2%, 18%, and 3% respectively. Thus, we consider the first and third 
components worth investigating further via confirmatory tetrad analysis for '/'-compatibility. 


K. waltii 


S. kudriavzevii 


S. bayanus 



S. mikitae 


cerevisiae 


Figure 4: Quintet tree X 5 of yeast species (Marcet-Houben & Gabaldon 2009|>. 


The results of the confirmatory tetrad analysis for '/^-compatibility (see Fig. |4|> gave p-values of 0.721 and 
0.955 for the first and third components respectively. To double check these results we repeated the test using the 
bootstrapping strategy outlined in Bollen & Stine ( |1992| . The results were very similar with p-values of 0.729 
and 0.921 respectively. The confirmatory tetrad analysis and inverse-Wishart simulation results both gave upper 
bounds on T 5 -compatibility, but on balance we concluded that the first and third components were T 5 -compatible. 
Therefore, the class of Gaussian latent tree models did appear suitable for modelling some aspects of these yeast 
species’ growth curves. However, for features relating to components 2 and 4, there is some evidence to support 


the exploration of a wider model class that could accommodate the hybrid hypothesis described in Libkind et al. 

d20TTT >. 


6 Discussion 

Understanding the complete description of the correlation space associated with Gaussian latent tree models 
opens up a number of useful tools for assessing tree-compatibility either on a class basis or for a specified 
tree. Some of the methods described in this paper are particularly useful as part of an exploratory analysis for 
defining the relevant model search space, whereas others are ideal as a final step to check the conclusions of a 
model search. The complete semialgebraic structure of the correlation space has not been utilized elsewhere for 
assessing tree-compatibility of data, though the positivity constraint has been used previously, see Shiers et al. 

( 2Q14| >. Incorporating a prior such as the inverse-Wishart and sampling from the posterior distribution allows for 
probabilistic conclusions about the model. This provides a more nuanced answer than a simple assessment of 
inequalities via the plugging in of covariance point estimates, and enables two or more incompatible but plausible 
trees to be compared. 

One important practical consideration is the scalability of these methods. Techniques employing the semi¬ 
algebraic constraints can be adapted to larger number of variables reasonably well. For a confirmatory tetrad 
analysis the biggest computational cost is the calculation of the covariance of minors, which for p observed 
random variables has dimension of order p 4 which can become prohibitive. For example, if 8 GB of RAM is 
allocated for a single matrix, the limit of p is approximately 25 even if redundant rows and columns are removed 
from the matrix. However, much larger p can be considered by calculating the relevant statistics for each quartet 
marginally. Then the covariance matrix of minors has dimension of only 36 and the memory can be released 
once each quartet has been tested. In either case, the final memory requirement could further be reduced with 
smart programming taking advantage of symmetries and sparseness. In a similar vein, to extend the scope of 
exploratory tetrad analysis to a greater number of variables, one strategy is to only assess single quartets in the 
first instance and use these results to reduce the set of possible trees worth considering. Given the effectiveness 
of the quartet testing as demonstrated with even small sample sizes, this approach seems sensible and to have 
significant advantage over methods that require a whole model to be tested at once. 
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A Proofs 

of Theorem^ Assume first that all correlations p l7 are strictly positive, that is R £ PO + (U) or R £ PO+(T). 
We use the fact that PO + (V) is isomorphic to the space of tree metrics, whose constraints are given in Theorem[l] 
and Corollary [T| Translating these constraints via dij = — log {pij) gives exactly the constraints in the proposed 
theorem. These constraints describe a closed set, which is the smallest closed set containing PO+(U). So it is 
enough to show that the closure of l ’0 + (7’) is equal to PO(T). This follows from the fact that PO(T) is a toric 
cube and, by Engstrom et al.|( |2012[ Theorem 1), every toric cube is equal to the closure of its interior. □ 

of Theorem^ If R € M(T) then each p. l3 has representation (jTJ). Thus \pij\ = TleephOj) | Pel and hence R! 
also lies in PO(T). To show that pijPikPjk > 0 consider the induced subtree over i,j , k , that is, the smallest 
connected subgraph of T containing vertices i,j , k. This subtree necessarily has a unique vertex v that lies on 
the intersection of paths ph (ij), ph(ifc) and ph (jk). Moreover, by Q, 

pijpikPjk = n pe n * n ^= n ^ n ^ n a ~ °- 

e(E.ph(ij) eGph (ik) eGph (jk) eEph(iu) e£ph(jv) eEph(fcu) 

To prove the reverse implication, we note that every correlation matrix in PO(T) has (after permuting rows and 
columns) a block diagonal structure with strictly positive elements in each block. Consider first the case when 
all elements of R are non-zero, that is, R! has strictly positive entries. Distinguish one node in V and label it as 
1. Let D be a diagonal matrix such that Da = —1 if pu < 0 and D it = 1 if pu > 0. If R £ M(T) then also 
DRD lies in M(T ) because M{T) is invariant with respect to all diagonal transformations. Moreover, it holds 
that R' = DRD because DuDupu = |pij| for all* £ V \ {1} and DaDjjPij = \pij \ for i,j £ V \ {1}. This 
last equality follows from our assumption that pupijPij > 0 so that the sign of pupij is equal to the sign of 
Pij. Now, since R' £ PO(T) C M(T) and R = DR'D we also have that R £ M(T). The analysis of the case 
when R is block diagonal will be omitted. □ 
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(c) two tetrads 


(d) defining tetrads 


Figure 5: Illustration of the simulations in Section 5.2 Sixty observations are generated from a random matrix 
in the tree model for the quintet tree 12|5|34 and the corresponding test statistic is computed. This procedure is 
iterated 10,000 times. In each figure we compare the sample distribution of a test statistics against its theoretical 
distribution. In (a) we test a single tetrad 12134. In (b) we test a single false tetrad 13124. In (c) we test two 
tetrads 12135, 15|34, and in (d) we test a minimal set of quartets defining the true quintet tree. The solid lines are 
densities of xf , Xi> xi> an d X§ respectively. 
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